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Executive Summary 



Universal design is an approach to educational assessment based on principles of accessibility 
for a wide variety of end users. Thompson, Johnstone, and Thurlow described seven elements 
of universally designed assessments in their 2002 report entitled Universal Design Applied 
to Large Scale Assessments. Elements of universal design include inclusive test population; 
precisely defined constructs; accessible, non-biased items; tests that are amenable to accom- 
modations; simple, clear and intuitive procedures; maximum readability and comprehensibility; 
and maximum legibility. Since the 2002 report, Universal Design Project staff have examined 
research from a variety of fields in an effort to specify how elements of universally designed 
assessments can be put into practice. 

This report describes the development of a “considerations of universally designed assessments” 
form based on Thompson et al. ’s original elements. Considerations are specific questions for test 
designers to take into account while designing assessments. This report provides the original 
list of considerations from Thompson et al., then describes a validation process, whereby as- 
sessment and content area experts participated in a Delphi study. The Delphi study illuminated 
expert consensus on some considerations and disagreement on others. All expert commentary 
is captured in the text of this paper and in Appendix C (in tabular form), and a revised list of 
considerations is found in Appendix D. 

Based on the comprehensive work represented in this report, several recommendations are pre- 
sented for the use of the considerations of universal design at all stages of test development: 

1. Incorporate elements of universal design in the early stages of test development. 

2. Include disability, technology, and language acquisition experts in item reviews. 

3. Provide professional development for item developers and reviewers on use of the 
considerations for universal design. 

4. Present the items being reviewed in the format in which they will appear on the 
test. 

5. Include standards being tested with the items being reviewed. 

6. Try out items with students. 

7. Field test items in accommodated formats. 

8. Review computer-based items on computers. 
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Introduction 



The term universal design has been applied to a variety of educational approaches over the past 
several years. For instance, universal design for learning was first described by the Council 
for Exceptional Children (CEC) in a Research Connections article (CEC, 1999). Likewise, 
Thompson, Johnstone, and Thurlow (2002) of the National Center on Educational Outcomes 
(NCEO) described universal design approaches to large-scale assessment. In their initial paper 
on universal design of assessments, Thompson et al. outlined seven elements of universally 
designed assessments (inclusive assessment population; precisely defined constructs; acces- 
sible, non-biased items; amenable to accommodations; simple, clear and intuitive procedures; 
maximum readability and comprehensibility; and maximum legibility). Although elements of 
universal design provide guidance to states and assessment companies about design issues, there 
is still a need for specific information concerning what considerations should be made in test 
development in order to make tests accessible to a wide range of students. 

This report summarizes the process of developing and refining a list of considerations for the 
universal design of statewide assessments for all students, including students with disabilities 
and English language learners. The staff of the Universal Design Project at NCEO, working 
closely with experts in the fields of assessment, disability, content areas (reading and math), 
and language acquisition, completed this version of considerations in the summer of 2004. This 
revision was one of three, which followed the compilation of an initial set of considerations 
identified from a literature review of multiple content areas (see Thompson, et al., 2002). The 
first version included stakeholder input from the Council of Chief State School Officers (CCSSO) 
conference on large-scale assessment in 2003. Following CCSSO feedback, a second version 
(a Delphi review, see description later in the text) was developed by NCEO in partnership with 
the Minnesota Department of Education, with a primary focus on students with limited English 
proficiency. This report describes the process of refining the considerations during a third vali- 
dation study conducted by the Universal Design Project at NCEO. This is the third version of 
the considerations for use by test developers and item reviewers. This report also discusses the 
process used to validate the considerations, the issues that arise when using these considerations, 
and recommendations for use. 



Purpose of the Study 

The purpose of this report is to describe the process of developing and refining a set of consid- 
erations for item developers and item review teams to take into account in the universal design 
of inclusive, standardized, statewide assessments. Although the goal of this process was to 
find design strategies that maximize the accessibility of tests and test items, a larger goal was 
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to create an instrument to guide careful consideration of the elements of test design in order to 
discover issues in items that may be problematic. 



What is Universal Design? 

More than 20 years ago, Ron Mace, an architect who was a wheelchair user, began to actively 
promote a concept he termed “universal design.” Mace was adamant that his field did not need 
more special purpose designs that serve primarily to meet compliance codes and may also stig- 
matize people. Instead, he promoted design that works for most people, from the child who 
cannot turn a doorknob to the elderly woman who cannot climb stairs to get to a door (Mace, 
1998). 

The term universal design is found in the newly reauthorized Individuals with Disabilities 
Education Act of 2004 (Public Law No: 108-446). Specifically, IDEA of 2004 states that: 

The State educational agency (or, in the case of a districtwide assessment, the 
local educational agency) shall, to the extent feasible, use universal design 
principles in developing and administering any assessments under this paragraph 
(§ 612(a)(16)(E). 

Universal design is specifically defined in the U.S. Assistive Technology Act of 2004 (Public 
Law No. 108-364- ATA 2004) as follows: 

[A] concept or philosophy for designing and delivering products and services 
that are usable by people with the widest possible range of functional 
capabilities, which include products and services that are directly accessible 
(without requiring assistive technologies) and products and services that are 
interoperable with assistive technologies. 

Assessments that are universally designed are designed from the beginning, and continually 
refined, to allow participation of the widest possible range of students, resulting in more valid 
inferences about performance. These assessments are based on the premise that each child in 
school is a part of the population to be tested, and that test results should not be influenced by 
disability, gender, race, or English language ability. Universally designed assessments are not 
intended to eliminate individualization, but they may reduce the need for accommodations 
and various alternative assessments by eliminating access barriers associated with the tests 
themselves. 

The elements of universal design, according to Thompson et al., are: 
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1. Inclusive assessment population 

2. Precisely defined constructs 

3. Accessible, non-biased items 

4. Amenable to accommodations 

5. Simple, clear and intuitive procedures 

6. Maximum readability and comprehensibility 

7. Maximum legibility 

From these elements, universal design staff constructed considerations for universally designed 
assessments. The considerations are a list of specific questions that help test designers locate 
potential design issues in items. The considerations are listed in Table 1. 



Table 1: Considerations for Universally Designed Assessment Items 



Does the item... 



Measure what it intends to measure 

• Reflect the intended content standards (reviewers have information about the content being 
measured) 

• Minimize skills required beyond those being measured 

Respect the diversity of the assessment population 

• Accessible to test takers (consider gender, age, ethnicity, socio-economic level) 

• Avoid content that might unfairly advantage or disadvantage any student subgroup 

Have clear format for text 

• Standard typeface 

• Twelve (12) point minimum for all print, including captions, footnotes, and graphs (type size 
appropriate for age group) 

• Wide spacing between letters, words, and lines 

• High contrast between color of text and background 

• Sufficient blank space (leading) between lines of text 

• Staggered right margins (no right justification) 

Have clear pictures and graphics (when essential to item) 

• Pictures are needed to respond to item 

• Pictures with clearly defined features 

• Dark lines (minimum use of gray scale and shading) 

• Sufficient contrast between colors 

• Color is not relied on to convey important information or distinctions 

• Pictures and graphs are labeled 

Have concise and readable text 

• Commonly used words 

• Vocabulary appropriate for grade level 

• Minimum use of unnecessary words 

• Idioms avoided unless idiomatic speech is being measured 

• Technical terms and abbreviations avoided (or defined) if not related to the content being measured 

• Sentence complexity is appropriate for grade level 

• Question to be answered is clearly identifiable 
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Table 1: Considerations for Universally Designed Assessment Items (continued) 



Allow changes to its format without changing its meaning or difficulty (including visual or 
memory load) 

• Allows for the use of braille or other tactile format 

• Allows for signing to a student 

• Allows for the use of oral presentation to a student 

• Allows for the use of assistive technology 

• Allows for translation into another language 



Does the test... 



Have an overall appearance that is clean and organized 

• All images, pictures, and text provide information necessary to respond to the item 

• Information is organized in a manner consistent with an academic English framework with a left- 
right, top-bottom flow 



In addition to the other considerations, a computer-based test should have these 
considerations: 



Layout and design 

• Sufficient contrast between background and text and graphics for easy readability 

• Color is not relied on to convey important information or distinctions 

• Font size and color scheme can be easily modified (through browser settings, style sheets, or on- 
screen options) 

• Stimulus and response options are viewable on one screen when possible 

• Page layout is consistent throughout the test 

• Computer interfaces follow Section 508 guidelines 

Navigation 

• Navigation is clear and intuitive; it makes sense and is easy to figure out 

• Navigation and response selection is possible by mouse click or keyboard 

• Option to return to items and return to place in test after breaks 
Screen reader considerations 

• Item is intelligible when read by a text/screen reader 

• Links make sense when read out of visual context (“go to the next question” rather than “click here”) 

• Non-text elements have a text equivalent or description 

• Tables are only used to contain data, and make sense when read by screen reader 

Test specific options 

• Access to other functions is restricted (e.g., e-mail, Internet, instant messaging) 

• Pop up translations and definitions of key words/phrases are available if appropriate to the test 

• Students are able to record their responses and read them back (and have them read back using 
text-to-speech) as an alternative to a human scribe, but only if student has experiences with this 
mode of expression and chooses it for the test 

Computer capabilities 

• Adjustable volume 

• Speech recognition available (to convert user’s speech to text) 

• Test is compatible with current screen reader software 

• Computer-based option to mask items or text (e.g., split screen) 

• Computer software for test delivery is designed to be amenable to assistive technology 



Delphi 

We conducted a Delphi review to determine the usefulness of existing considerations for uni- 
versally designed assessments. The intent of the Delphi review was to invite experts in the fields 
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of assessment, special education, academic content, and language acquisition to give input on 
the considerations and modify them accordingly (Adler & Ziglio, 1996). The Delphi method 
is a structured process of using a series of questionnaires to gather the combined input from a 
group of persons with expertise related to a specific area or population. The method has been 
used in the social science and public health fields since the mid-1970s (Adler & Ziglio, 1996). 
Delphi studies allow participants to give their own informed opinion on an issue. The input is 
then compiled and returned to the participants who can respond to further questions, respond to 
the input from the other participants, and revise their own comments if desired. All iterations 
of Delphi are anonymous. 

This Delphi study took place entirely by e-mail. Participants were unaware of who was invited 
to participate in the study, who elected to participate, and the individuals who provided feedback 
(anonymity was maintained throughout the study). All suggestions and comments were given 
equal weight. 

Participants 

Universal Design Project research staff identified a group of experts to review the consider- 
ations for universally designed assessments. To ensure that important areas of expertise were 
represented, a chart was created and participants were recommended based on their expertise 
in one or more of the identified areas (see Table 2). These individuals were then invited to par- 
ticipate in the Delphi review before the first Delphi questionnaire was sent out. The resulting 
group of Delphi participants represented experts in the field of assessment, assistive technol- 
ogy, computer-based testing, reading, math, second language acquisition and testing, disability 
consultation, and special education. 



Table 2: Expertise and Participants 



Vision 


Barbara Henderson 


Computer-based testing, learning disabilities 


Gerald Tindal 


Item analysis 


Karen Barton 


Second language acquisition and testing 


Margo Gottlieb 


Second language acquisition, testing, and translation 


Charles Stansfield 


Physical disabilities 


Sheryl Burgstahler 


Hearing. 


Carol Traxler 


Science 


Scott Marion 


Psychometrics 


Tom Haladyna 


Assistive technology 


Tracy Hall 


Math 


Marge Petit 


Special education assessment 


Ken Olsen 


State Assessment Director 


Tim Vansickle 
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Delphi Process 



The first Delphi survey (Delphi Form 1— see Appendix A) was developed to obtain specific 
feedback on the considerations draft presented by NCEO. Expert participants were provided 
ample opportunity to comment on the considerations or add to the list. The participants were 
asked first to rate the importance of each individual consideration on a five point Likert scale. 
They then were asked to comment on any of the considerations about which they felt strongly 
positive or negative. They could also pose questions on the form. Finally, they were asked to 
add any additional considerations and rate the importance of their additions. The participants 
were instructed to try to think about the considerations in terms of their usefulness for test 
developers and item reviewers. 

In July 2004, the first Delphi survey (Delphi Form 1) was e-mailed to the participants. Each 
participant was given seven days to review the considerations and e-mail comments back to 
NCEO. The comments and ratings were returned by 13 of 14 participants. These were compiled 
at NCEO and a second survey was developed (Delphi Form 2-see Appendix B). 

The second survey (Delphi Form 2) included a list of anonymous individual ratings and the 
mean from all ratings assigned to each consideration. All comments made by the participants 
on the first form were included in the second form. Participants were asked to comment on 
results from the initial survey, were probed on specific issues by NCEO researchers, and were 
asked to comment on the 15 considerations suggested by participants (the majority relating to 
computer-based testing). The second survey was e-mailed out at the beginning of August 2004 
and participants were again given seven days to return their comments via e-mail. The comments 
were complied by the staff at NCEO in mid- August, 2004 (see Appendix C). 



Response Rates 

The original list of considerations (Delphi Form 1) was sent out via e-mail to 14 experts for 
review. Thirteen of 14 (93%) experts returned Delphi Form 1. The second survey (Delphi Form 
2) was again sent out to the original 14 participants. The same thirteen participants returned 
the second survey (one participant did not participate in either survey). The feedback on both 
surveys was extensive. 



Results 

Using the feedback from both Delphi surveys, Universal Design Project staff revised the consid- 
erations for universally designed assessments (see Table 3). The considerations that had originally 
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been sent to reviewers were rated as somewhat important to extremely important (from 2.67 to 
5), with an average of very important (i.e., 4.3) to consider in designing and reviewing assess- 
ments. One consideration was deleted based on expert feedback, while others were added or 
revised. The primary additions to the considerations were the expansion of the considerations 
for computer-based testing. In addition, there were several additions to the discussion points 
for the consideration note sections. All changes to the considerations are shown in Table 3, with 
additions marked by underlines and deletions shown by strikethroughs. 



Table 3: Summary of Consideration Ratings and Changes 



Does the item... 


Range 


Mean 


Measure what it intends to measure 






• Reflect the intended content standards (reviewers have information about the 


5-5 


5.00 


content being measured) 






• Minimize knowledae and skills reauired bevond those boinq what is intended 


3-5 


4.33 


for measured measurement. 






Respect the diversity of the assessment population 






• Accessible Sensitive to test takers characteristics and experiences (consider 


4-5 


4.75 


aae. aender. ethnicitv. and socio-economic level, reaion. disability, and 






lanquaqe) 






• Avoid content that might unfairly advantage or disadvantage any student 


4-5 


4.64 


subgroup 






Have clear format for text 






• Standard typeface 


3-5 


4.00 


• Twelve (12) point minimum size for all print, including captions, footnotes, and 


3-5 


4.09 


graphs (type size appropriate for age group) 






• Wide spacing between letters, words, and lines 


2-5 


3.09 


• High contrast between color of text and background 


3-5 


4.09 


• Sufficient blank space (leading) between lines of text 


2-5 


2.82 


• Staggered right margins (no right justification) 


2-5 


3.36 


Have clear visuals (when essential to item) 






• Pictures Visuals are needed to respond to item answer the auestion 


3-5 


4.56 


• Pictures Visuals with clearly defined features (minimum use of arav scale and 


4-5 


4.45 


shadinq) 






• Dark linos (minimum use of gray scale and shading) 


3-5 


3.82 


• Sufficient contrast between colors 


1-5 


3.64 


• Color alone is not relied on to convev important information or distinctions 


2-5 


3.91 


• Pictures and qraphs Visuals are labeled 


3-5 


3.91 


Have concise and readable text 






• Commonlv used words (except vocabulary beina tested) 


1-5 


4.18 


• Vocabulary appropriate for grade level 


4-5 


4.83 


• Minimum use of unnecessary words 


1-5 


4.17 


• Idioms avoided unless idiomatic speech is being measured 


3-5 


4.67 


• Technical terms and abbreviations avoided (or defined) if not related to the 


4-5 


4.73 


content being measured 






• Sentence complexity is appropriate for grade level 


1-5 


4.45 


• Question to be answered is clearly identifiable 


5-5 


5.00 
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Table 3: Summary of Consideration Ratings and Changes (continued) 



Allow changes to its format without changing its meaning or difficulty 
(including visual or memory load) 

• Allows for the use of braille or other tactile format 

• Allows for signing to a student 

• Allows for the use of oral presentation to a student 

• Allows for the use of assistive technology 

• Allows for translation into another language 


3-5 

3-5 

3-5 

3-5 

1-5 


4.67 

4.55 

4.36 

4.45 

3.64 


Does the test... 


Have an overall appearance that is clean and organized 






• All visuals (e.a., imaaes, pictures) and text provide information necessary to 


3-5 


4.50 


respond to the item 






• Information is organized in a manner consistent with an academic English 


4-5 


4.33 


framework with a left-right, top-bottom flow 






• Booklets/materials can be easilv handled with limited motor coordination 


0-5 


4.00 


(consideration was added) 






• Response formats are easilv correlated matched to auestion 


0-5 


3.43 


• Place for student to take notes (on the screen for CBT) or extra white space 


0-5 


3.82 


with pacer-pencil 






In addition to the other considerations, a computer-based test should have 






these considerations: 






Layout and design 






• Sufficient contrast between background and text and graphics for easy 


4-5 


4.67 


readability 






• Color alone is not relied on to convev important information or distinctions 


2-5 


3.92 


• Font size and color scheme can be easily modified (through browser settings, 


2-5 


4.08 


style sheets or on-screen options) 






• Stimulus and response options are viewable on one screen when possible 


3-5 


4.67 


• Page layout is consistent throughout the test 


4-5 


4.75 


• Computer interfaces follow Section 508 guidelines (www.section508.gov) 


0-5 


3.56 


Navigation 






• Students have received adeauate trainina on use of test delivery system 


0-5 


4.46 


• Navigation is clear and intuitive; it makes sense and is easy to figure out 


4-5 


4.92 


• Navigation and response selection is possible by mouse click or keyboard 


3-5 


4.67 


• Option to return to items and return to place in test after breaks 


3-5 


4.60 


Screen reader considerations 






• Item is intelligible when read by a text/screen reader 


3-5 


4.58 


• Links make sense when read out of visual context, (“go to the next question” 


4-5 


4.67 


rather than “click here”) 






• Non-text elements have a text equivalent or description 


3-5 


4.30 


• Tables are only used to contain data, and make sense when read by screen 


3-5 


4.36 


reader 
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Table 3: Summary of Consideration Ratings and Changes (continued) 



Test specific options 

• Access to other functions is restricted (e.g., e-mail, Internet, instant messaging) 

• Pop up translations and definitions of key words/phrases are available if 
appropriate to the test 

• Students writing online can get feedback on length of writing on-demand in 
cases where there is a restriction on number of words. 

• Students are able to record their responses and read them back jor hav e th e m 
r e ad - back us i ng t e xt - to - sp ee ch) as a l t e rnat i v e to human scr i bb le , but on l y i f 
stud e nt has e xp e r ie nc e s w i th th i s mod e of e xpr e ss i on and choos e s i t for th e 
test as an alternative to human scribe . 

• Students are allowed to create persistent marks to the extent that they are 
already allowed to paper-based booklets (e.g.. marking items for review, 
eliminating multiple choice items, etc.) 

Computer capabilities 

• Adjustable volume 

• Speech recognition available (to convert user’s speech to text) 

• Test is compatible with current screen reader software 

• Computer-based option to mask items or text (e.g., split screen) 

• Computer software for test delivery is designed to be amenable to assistive 
technology 



3-5 

3-5 

0-5 

0-5 



0-5 



3-5 

1-5 

3-5 

0-4 

0-5 



4.55 

4.08 

2.67 

3.69 

4.17 



4.50 

3.67 

4.25 

3.00 

3.91 



Notes that were added to the considerations address some of the anticipated issues that might 
arise when using the considerations. While we tried to keep the list of considerations brief and 
user-friendly, it was clear from participant comments that more explanation about the intent and 
issues surrounding the considerations needed to be presented close to the considerations in note 
form. The notes are not meant to be used as definitive judgment of the “good” or “bad” quality 
of an item or design feature. Instead, the notes are intended to add clarity to the considerations, 
help elucidate important issues, and help generate discussion. 



Discussions About Selected Considerations 

In addition to providing greater clarity to several of the considerations, many of the respondents 
in the Delphi review pointed out that using some of the considerations depended on the content 
being tested. Extensive discussion focused on issues of construct vs. content validity and the 
minimization of construct-irrelevant variance. There was also extensive discussion on the va- 
lidity and practicality of the translation of assessments to languages other than English. In this 
section of the report, we present a detailed review of these discussions. Considerations about 
which few comments were made and no clarification was deemed necessary are not discussed. 
Responses to all considerations, however, can be found in Appendix C. 
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Consideration: “Reflects the intended content standards ( reviewers have information about 
the content being measured).” 

Following a discussion by Universal Design Project staff, Delphi participants were asked to 
comment on whether the first consideration should remain “Reflects the intended content stan- 
dards (reviewers have information about the content being measured)” or whether it should 
be reworded “Reflects the intended construct (reviewers have information about the construct 
being measured).” Although opinions leaned toward changing the wording (Yes = 6, No = 3, 
Combination wording = 1, Did not state position but provided information to consider when 
making the decision = 2, Don’t know = 1), only two of the participants in favor of using the term 
“construct” provided reasoning. One suggested that construct “would fit better with the profes- 
sional terminology,” while the other stated that “content is topical, constructs are conceptual. 
This difference in meaning is huge. Furthermore, construct is a term used in APA standards and 
is deeper than content.” 

The participants who wished the consideration to remain the same provided critical informa- 
tion about what to think about before a decision could be made. Specifically, one participant 
suggested that we consider our audience: “Construct is a formal term that theorists use. Content 
standards [are] what practitioners understand.” Another participant suggested we consider what 
the terms imply: “...construct is a sort of overarching concept (i.e., reading) whereas content 
standards are. . .narrower (e.g., reproduces capital letters). . .If the test is supposed to be a stan- 
dards-based achievement test, then it must address standards. If not, then the item need only 
address the construct.” 

Ultimately, Universal Design Project staff decided to retain the term “content.” This term ap- 
pears to be consistent with the link of items to standards, and avoids the apparent confusion 
surrounding the term “construct.” It should be noted, however, that the term “construct” may 
still be useful, especially if item developers (who are familiar with the concept of constructs) 
are using these considerations. 

Consideration: “Minimize knowledge and skills required beyond those being what is intended 
for measured measurement. ” 

The second consideration under review was altered slightly based on participant input. Initially, 
this consideration stated, “Minimize skills required beyond those being measured.” This was 
changed to “Minimizes knowledge and skills required beyond what is intended for measurement” 
following several suggested alternate phrases. In addition to suggestions on phrasing, Delphi 
participants expressed concern that item writers or reviewers might interpret this consideration 
in such a way as to “.. .separate skills too much. . . [and thus run the risk that] we’ll wind up with 
tests that measure isolated, basic skills.” Still others expressed the belief that this consideration 
has direct relevance for the measurement of “higher level thinking.” Yet, as another reviewer 
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questioned, “how... the other skills (are) defined and targeted” would be important in guiding 
item writers and reviewers. One participant summed up the issue by saying that it “. . .depends 
on how discrete the standards are; minimal skills can be embedded in more complex contextual- 
ized items. Ultimately, it depends on what you are measuring.” 

Consideration: “ Accessible Sensitive to test takers characteristics and experiences (consider 
age, gender, ethnicity, and socio-economic level, region, disability, and language. ’’ 

The third consideration was changed from “Accessible to test takers (consider age, gender, 
ethnicity, and socio-economic level” to “Sensitive to test taker characteristics and experiences 
(consider gender, age, ethnicity, socio-economic level, region, disability, and language).” When 
asked about including the term “bias” in this consideration, participants were somewhat divided. 
While some indicated that bias should be included to “reference systematic variance that inter- 
feres with making a valid inference,” others clarified that “bias and accessibility are separate 
issues from a review standpoint, though obviously related.” Keeping participants’ suggestions 
and reasoning in mind, it was decided that the term “bias” would be included in the note portion 
of the consideration and that the demographic variables would be expanded from four to seven, 
reflecting the need for greater sensitivity to the experiences of very diverse populations. 

Consideration: “Standard typeface. ” 

When considering the clarity of the format for text in assessments, most participants agreed 
that a standard typeface was important. There was, however, confusion about the meaning of 
“standard.” Some Delphi participants had interpreted this consideration as implying that a single 
standard font existed, as illustrated in the following comment: “There is no standard typeface, 
thus the myriad fonts used in various publisher’s files, even within the same text or textbook.” 
In order to reduce confusion over the meaning of the term, however, it was determined that ad- 
ditional clarification was needed. Consequently, the following was added to the note section: 
“Use clear, common, familiar, and consistent fonts,” followed by examples of font. 

Consideration: “Twelve (12) point minimum size for all print, including captions, footnotes, 
and graphs (type size appropriate for age group).’’ 

When considering which font size to select, several Delphi participants noted the importance of 
considering the font style. Given the fact that a 12-point font can vary in size depending upon 
the font style, an additional issue was included in the note section. As suggested, one consider- 
ation (width of spacing between letters) was combined with font. One participant stated “Wide 
spacing is not necessarily best; proper font selection is more important.” Consequently, this 
consideration was added to the note section of the consideration addressing font. 

Consideration: “High contrast between color of text and background. ” 

When considering the use of color in text or background, participants suggested going beyond 
the issue of contrast to consider print density. Specifically, one participant stated, “[E]ven with 
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sufficient color contrast, color blind users may not be able to distinguish text and background. [I] 
suggest you further recommend high print density contrast. This would also avoid isoluminance 
effects for non-visually-impaired students.” (“Isoluminance” is the point at which two colors 
have an equivalent luminous intensity, or brightness.) Based on these comments, information 
on print density and isoluminance was added to the note section for the consideration address- 
ing format for text. 

Consideration: “ Pictures Visuals are needed to respond to item answer the question. ” 

The use of visuals resulted in considerable discussion ranging from issues surrounding limiting 
visuals, the use of visuals to provide only redundant information, and the benefits/drawbacks of 
using visuals in relation to specific disabilities. In relation to the content of visuals, for example, 
it was suggested, “Pictures, line art, etc. should be related to the item [and] should enhance un- 
derstanding, [but] not [be] required for understanding, with the exception of data tables like on 
math and science tests.” Additionally, another Delphi participant stated, “often there are pictures 
used that are not redundant with the text but that are relevant to the item and to the construct.” 
Consequently, it was suggested that the wording of this consideration take this idea into account. 
Rather than dramatically change the wording of this consideration, qualifying information was 
provided to the note portion below the consideration addressing the idea that clear and well- 
designed graphics or pictures should add value for students who need a visual cue. 

Consideration: “Commonly used words (except vocabulary being tested) . ” 

When considering the vocabulary used in assessments, both for directions and specific items, 
many Delphi participants commented on the need for greater clarity surrounding the specifica- 
tion that the text be comprised of “commonly used words.” Several participants suggested that 
the term “age-appropriate” was preferable, while another suggested adding “concise and read- 
able.” Ultimately, the greatest concern with this particular consideration was that there be some 
acknowledgement that the words selected should be common, “with the exception of subject 
specific terminology...” In other words, the “item should consist of commonly understood 
words or vocabulary...” except when knowledge of specific vocabulary is being tested. One 
participant also suggested that the vocabulary be “. . .consistent with each specific grade level,” 
with another suggesting “at or below grade level [when] reading is not the primary construct 
tested.” As a result of this feedback, additional clarification was added to the wording of the 
consideration (i.e., the consideration was changed from “Commonly used words” to “Com- 
monly used words (except vocabulary being tested)” as well as in the note section following 
the consideration. 

Consideration: “Allows for translation into another language.” 

Perhaps the most controversial consideration of all was “Allows for translation into another lan- 
guage.” One and one-half pages of initial comments, questions, and suggestions were followed 
by an additional one and one-half pages of responses, comments, questions, and suggestions. 
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The response of one participant summarized a number of the issues that participants grappled 
with when determining the appropriateness of this consideration: 

“This is a questionable and highly controversial issue, particularly when one realizes 
that such a standard is impossible to meet. About 72% of our LEP students are Spanish 
speakers, but the other 28% represent many diverse languages. How do we accom- 
modate and what is the theoretical rationale and what is the technology for doing this? 

Is it possible? Is it beneficial?” 

In reference to the impracticality of translating tests into the less commonly represented lan- 
guage groups, some participants questioned the fairness of accommodating some students (e.g., 
Spanish speakers) and denying others. Another stated “What harm is done by helping the 72% 
of LEP students who speak Spanish? We provide accommodations to others where possible, 
but some would propose that a translated test is harmful. Poppycock!” 

Participants also suggested some disagreement in terms of the quality of the translations/skill 
of the translators. A primary problem with translation, however, was clear: “The limitation is 
money. Translation must be cost effective like everything else in education. You can’t provide 
translated tests for very small numbers. The Lau decision ( Lau v. Nichols , 1974) and other civil 
rights decisions make it clear that numbers dictate expectations of school systems.” Given the 
cost, customized dictionaries were suggested as a possible alternative to fully translated tests. 

Besides the practicality/impracticality of translating tests, one area of considerable debate sur- 
rounded the validity of the inferences that can be made from scores derived from translated 
tests. Some participants expressed the belief that translated tests reduced the validity of scores 
(“Data analysis has shown these to be less than valid measures of student performance.”), or that 
certain translations would result in less valid scores (“Some critical and relevant word/concepts 
[do] not translate into every language.”). Others, however, made the argument that there are 
few instances where concepts do not translate: 

“Minnesota translates to Hmong and Somali. Only in these languages are there rel- 
evant words/concepts that do not translate easily into English. The other languages 
of state assessment (Spanish, Russian, Chinese, Korean, Haitian Creole) almost never 
pose a problem for translating words or concepts. Professional translators will tell you 
they can translate almost any word or idea, and if they encounter one they can’t, they 
will tell you that too.” 

Another participant added, “Translation is no more a threat to validity than a change in option 
order or a change in font. Such changes might generate a miniscule change in item difficulty, 
but they don’t affect validity... [Translation] is the exact same test stated in a different language.” 
Yet others brought up the issue of validity in reference to a specific construct being measured. 
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For example, two participants stated that translating English language arts (ELA) tests would 
invalidate the inferences that could be made from the scores. In light of NCLB legislation, a 
participant brought up a final important point of consideration: “A translated test is always much 
less of a threat to validity and score comparability than an alternate assessment,” suggesting that 
a translated test is preferable to alternate assessment measures for English language learners. 

Two reviewers suggested that this consideration be eliminated given the controversy, at least 
until more research was available. Ultimately, Universal Design Project research staff decided 
to retain this consideration, acknowledging the issues item writers and reviewers face as they 
incorporate this consideration into the test construction/revision process. This information was 
included in the note section following the consideration. 



Summary of Revisions 

At the completion of the study, the Universal Design Project staff revised the original consid- 
erations based on Delphi responses (Appendix D). The most extensive revisions were made 
to the content and wording of the considerations. Some of the most significant changes to the 
considerations that resulted from the Delphi process are described here: 

1 . Wording of several of the considerations was revised using feedback from the Delphi 
review participants. For example, “Minimize skills required beyond those being 
measured” was changed to “Minimize knowledge and skills required beyond what 
is intended for measurement” and “Accessible to test takers (consider age, gender, 
ethnicity, and socio-economic level)” was changed and expanded to “Sensitive to test 
taker characteristics and experiences (consider gender, age, ethnicity, socio-economic 
level, region, disability, and language).” 

2. Computer-based testing considerations were expanded. Much of the useful feedback for 
this section came from reviewers who are familiar with the development of computer- 
based tests. With these revisions, the section of considerations for computer-based 
testing was clarified and redundancies with other considerations were eliminated. 

3. Notes were added to the considerations. These notes discuss some of the anticipated 
issues that might arise when using the considerations. While we tried to keep the 
list of considerations brief and user-friendly, it was clear that more explanation 
about the intent and issues surrounding the considerations needed to be presented on 
the same page. The notes are intended to add clarity to the considerations and help 
elucidate important issues. Notes also provide evidence of the complexity of some 
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of the considerations and illustrate that considerations are not static rules, but general 
principles that aid in flagging potentially problematic items. 

4. One font-dependent consideration (“Wide spacing between letters, words, and lines”) 
was eliminated. Instead it was included in the note section for “Have a clear format 
for text.” 

5. Relevant research citations were added to the considerations so that people wanting 
to investigate a certain issue in more depth would have the resource citations at hand 
(see Appendix E). 

6. We created a review checklist of the considerations for item reviewers and developers 
(see Appendix F). This form is intended to be used by item reviewers and developers 
who have received training on the considerations. It consists of a list of the 
considerations, without the supporting text. Using this form, item reviewers and 
developers can go through items and flag for further discussion areas of concern or 
alteration. For item reviewers, there is an additional form on which comments may 
be recorded explaining why some aspect of an item was flagged (Appendix G). 



Issues Related to Universal Design 

One of the most important outcomes of this review process was the identification of issues that 
surround the development of universally designed assessments. These issues highlight the 
complexities of a process without easy answers. The issues discussed in this section are not 
meant to be an exhaustive list of the challenges related to the universal design of assessments, 
but instead provide some guidance about the challenges that might be encountered when using 
the considerations. 

1. Universal design is not a cure all. Just because a test is universally designed, or has 
used the elements of universal design to guide its development, does not mean that 
a test is accessible to all students. The considerations recommended in this report 
are just that, considerations. They are meant to be used to guide test developers 
and reviewers in creating tests that are accessible to the greatest number of students 
possible. However, some changes to a test that might make it more accessible to 
one group of students, might actually make it less accessible to another group. For 
example, eliminating or altering an illustration accompanying an authentic reading text 
may clarify an item by removing a distraction for some students. On the other hand, 
eliminating it may remove or change some useful context for the passage. Issues of 
accessibility need to be carefully considered and discussed openly so that informed 
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decisions can be made without hindering the construct being tested. Universal design 
can be a useful tool for developing better assessments, but it is not a tool that can 
magically make all tests accessible to all students. 

2. Universal design does not replace accommodations. While universal design may 
remove some barriers for students with disabilities and English language learners, it 
in no way eliminates the need for testing accommodations. Some students may still 
need accommodations such as large print or assistive technology. A goal of universally 
designed assessments is to anticipate common accommodations and design tests that 
allow accommodations to be more easily integrated into the format of the test. 

3. Universal design does not replace good instruction. The goal of universal design 
is to think about the full range of students taking an assessment so that they all can 
demonstrate what they have learned. A student who has not had an opportunity to 
leam the material tested will not be helped by a universally designed test. 

4. Universal design does not lower standards. Some may perceive a universally 
designed assessment to be a “watered-down” or “easier” assessment. It is important 
to make clear the purpose of universal design is to make sure that the content being 
tested is more universally accessible to all of the students taking the test and thus a 
better measure of student learning. 

5. Technology use is challenging. The quality of technology available across schools 
is an important issue when creating a computer-based assessment. It is difficult to 
anticipate what accessibility issues will arise when a test is administered on a variety 
of different systems with a variety of assistive technologies. Trying to anticipate these 
issues is important, however, when reviewing items. 



Recommendations 

These considerations can be used to make assessments more universally accessible to the entire 
population of test takers. Here are some specific recommendations for the use of the consider- 
ations of universal design at all stages of test development. 

1 . Incorporate elements of universal design in the early stages of test development. 

Universally designed assessments present an opportunity to bring more people to the 
table in the early stages of test development including experts in disability, language 
acquisition, and technology. These experts are able to give more structured input at 
different stages of the test development process if they understand universal design 
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and have these considerations for item development and review at hand. It is more 
cost effective to consider universal design in the early stages of item development, 
rather than at the end when items have already been developed and field-tested. 

2. Include disability, technology, and language acquisition experts in item reviews. 

Every effort should be made to involve experts in item review who can judge whether 
items meet all of the considerations. 

3. Provide professional development for item developers and reviewers on use 
of the considerations for universal design. Explanation and discussion of each 
consideration will ensure use by item developers and reviewers. 

4. Present the items being reviewed in the format in which they will appear on 
the test. When item reviewers examine items to be included in an assessment, it is 
important to format items as closely as possible to how they will appear on the test. 
Since many of the considerations have to do with format, it is not useful to look at 
items that are not in the font, size, or format in which they will appear in the actual 
test booklet. 

5. Include standards being tested with the items being reviewed. Above all other 
considerations, the first consideration— does the item measure what it intends to 
measure— is of primary importance in constructing universally designed assessments. 
Consequently, item review teams using the considerations of universal design to guide 
their work must have the standard (grade level expectations) that each item is intended 
to test at hand. It is only by knowing what an item is intended to test that reviewers can 
judge whether an element of the item might interfere with student access. Each item 
needs to be presented with the corresponding standard being tested in that item. 

6. Try out items with students. Some of the elements of an item that distract or confuse 
students are not easily recognizable by adults or native English speakers. For this 
reason, trying items out with students by conducting think-aloud studies can provide 
valuable information about whether an item is testing the content intended (Thompson, 
Johnstone, & Miller, in press). 

7. Field test items in accommodated formats. In order to ensure that the content an 
item is intended to measure is not being changed when an accommodated format of 
a test is being used, include students using accommodated test formats in field tests. 
While this can add additional expense to the field test, there are ways of doing such 
studies that can progressively build a database. For example, a field test could focus 
on the use of certain accommodated formats one year and others the next, building 
up a database for the various forms of the test. Again, qualitative data from student 
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interviews in this area can provide important information that can be used to improve 
items. 

8. Review computer-based items on computers. To judge whether computer-based 
items are universally designed, item reviewers need to use the technology that will 
be used to deliver the test. Using a paper print-out of an assessment does not allow a 
review team to meaningfully consider the format of the test. 



Conclusion 

We hope that the process detailed in this report has produced not only a better set of consid- 
erations of universally designed assessments for all students, but has also clarified some of 
the opportunities and challenges that universally designed assessments present. While using 
universal design does not guarantee the accessibility of any test to all students, using the con- 
siderations to openly discuss issues of test design throughout the test development process can 
make any assessment more inclusive. Making the process of test development more transparent, 
informed, and focused on the needs of the entire population of students will help ensure that the 
assessment results are more meaningful for the widest range of students. 
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Appendix A 

Delphi Review of Test Item Considerations (Form 1) 



Rating scale for importance: 

5=Extremely important to consider; 4= Very important to consider; 3=Important to consider; 
2=Somewhat important to consider; l=Not important to consider. 

Scales adapted from Ziglio (1996). 



Considerations when reviewing 
any test item: 


Subject 

Responses 


Mean 


Please insert 

your comments here... 


Does the item... 


Measure what it intends to 
measure 

• Reflects the intended content 
standards (reviewers have 
information about the content 
being measured) 

• Minimize skills required beyond 
those being measured 








Respect the diversity of the 
assessment population 

• Accessible to test takers 
(consider age, gender, ethnicity, 
and socio-economic level) 

• Avoids content that might 
unfairly advantage or 
disadvantage any student 
subgroup 








Have a clear format for text 

• Standard typeface 

• Type size appropriate for age 
group (12 point minimum for 
all print, including captions, 
footnotes, and graphs) 

• Wide spacing between letters, 
words, and lines 

• High contrast between color of 
text and background 

• Sufficient leading (blank space) 
between lines of text 

• Staggered right margins (no 
right justification) 
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Have clear pictures and 

graphics (when essential to 

item) 

• Pictures are needed to respond 
to item 

• Pictures with clearly defined 
features 

• Dark lines (minimum use of 
gray scale and shading) 

• Sufficient contrast between 
colors 

• Color is not relied on to convey 
important information or 
distinctions 

• Pictures and graphs are labeled 








Have concise and readable text 

• Commonly used words 

• Vocabulary appropriate for 
grade level 

• Minimum use of unnecessary 
words 

• Idioms avoided unless idiomatic 
speech is being measured 

• Technical terms and 
abbreviations avoided (or 
defined) if not related to the 
content being measured 

• Sentence complexity is 
appropriate for grade level 

• Question to be answered is 
clearly identifiable 








Allow changes to its format 
without changing its meaning 
or difficulty (including visual or 
memory load) 

• Allows for the use of braille or 
other tactile format 

• Allows for signing to a student 

• Allows for the use of oral 
presentation to a student 

• Allows for the use of assistive 
technology 

• Allows for translation into 
another language 
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Does the test... 



Have an overall appearance that 
is clean and organized 

• All images, pictures, and text 
provide information necessary 
to respond to the item 

• Information is organized in a 
manner that is consistent with 
an academic English framework 
with a left-right, top-bottom flow 








In addition to the other considerations, a computer-based test should have these 
considerations: 


Layout and design 

• Sufficient contrast between 
background and text and 
graphics for easy readability 

• Color is not relied on to convey 
important information or 
distinctions 

• Font size and color scheme 
can be easily modified (through 
browser settings, style sheets, 
or on-screen options) 

• Stimulus and response options 
are viewable on one screen 
when possible 

• Page layout is consistent 
throughout the test 








Navigation 

• Navigation is clear and intuitive; 
it makes sense and is easy to 
figure out 

• Navigation and response 
selection is possible by mouse 
click or keyboard 

• Option to return to items and 
return to place in test after 
breaks 








Screen reader considerations 

• Item is intelligible when read by 
a text/screen reader 

• Links make sense when read 
out of visual context, (“go to the 
next question” rather than “click 
here”) 

• Non-text elements have a text 
equivalent or description 
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• Tables are only used to contain 
data and make sense when 
read by screen reader 








Test specific options 

• Access to other functions is 
restricted (e.g. e-mail, Internet, 
instant messaging) 

• Pop up translations and 
definitions of key words/phrases 
are available if appropriate to 
the test 








Computer capabilities 

• Adjustable volume 

• Speech recognition available (to 
convert user’s speech to text) 

• Test is compatible with current 
screen reader software 









Items on this form are based on information presented in Thompson, Johnstone, & Thurlow (2002, 
Universal Design Applied to Large Scale Assessments, Synthesis Report 44); Thompson & Thurlow 
2002, Universally Designed Assessments: Better Tests for Everyone! , Policy Directions 14), and Kopriva 
(2002, Ensuring Accuracy in Testing for English Language Learners, CCSSO SCASS-LEP Consortium), 
as well from NCEO staff brainstorming and input received from participants in the Universal Design 
Pre-conference Clinic at the CCSSO Large Scale Assessment and Accountability Conference in San 
Antonio, Texas, June 2003 and input from a joint project/Delphi review with the Minnesota, Nevada, and 
South Carolina Departments of Education. 
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Appendix B 

Delphi Review of Test Item Considerations (Form 2) 
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Layout and design 

• Sufficient contrast between 444455555555 4.67 1 -“Sufficient luminance contrast between., 

background and text and 
graphics for easy readability 
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Appendix C 

Original Considerations Plus All Expert Commentary 
Delphi Review of Test Item Considerations 
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In addition to the other considerations, a computer-based test should have these considerations: 
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Appendix D 

Revised Considerations Based on Delphi Results 



Considerations for Universally Designed Assessment Items 

These guidelines contain suggestions that test developers, item reviewers, and others working 
on the development of assessments should consider at the beginning stages of designing an as- 
sessment that is accessible to the widest range of students possible. Unless stated otherwise, all 
considerations apply to both paper/pencil and computer-based assessments. 



Considerations when reviewing any test item: 



Does the item... 

Measure what it intends to measure 

• Reflect the intended content standards (reviewers have information about the content being 
measured). 

• Minimize knowledge and skills required beyond what is intended for measurement. 

Notes: 

a. Content area assessments must be aligned to grade level state academic content standards 
(grade level expectations). 

b. Information about the content standard(s) assessed by each item must be supplied to reviewers. 

c. Careful consideration of the way content standards are phrased is important in determining what 
knowledge and skills involved in responding to an item are extraneous and which are relevant to 
what is being tested. 

d. When considering what is being measured there is somewhat of a “balancing act.” In certain 
types of test items additional skills may be necessary. For example, responses to a listening 
test must be spoken or written, requiring skills in at least one modality in addition to listening. A 
similar issue is presented on math tests that require skills in reading. 

• While it is important to minimize knowledge and skills beyond what is intended for 
measurement, it cannot take precedence over the ability to measure all content areas to be 
assessed (e.g., drawing a graph of the results). 

• When measuring skills such as higher-order processing skills, it is difficult to focus the 
assessment items only on the explicitly targeted content areas. 
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Does the item... 

Respect the diversity of the assessment population 

• Sensitive to test taker characteristics and experiences (consider gender, age, ethnicity, socio- 
economic level, region, disability, and language) 

• Avoid content that might unfairly advantage or disadvantage any student subgroup 

Notes: 

a. Avoid bias toward or against any group of students that may cause them to have difficulty 
responding to items or create emotional stress. 

b. Carefully evaluate what assumptions items make about shared experiences. 

c. Tests should strive to avoid content that negatively depicts any student subgroup and avoid 
content that potentially provokes a negative reaction in any student subgroup. 

d. Gender, etc., should not be a barrier to understanding the task an item requires. 

e. It is important to recognize that every test is biased, although universal design serves to 
minimize its impact. For example, English language learners are a subset of language minority 
students. 

f. In an effort to make assessments more accessible, item writers and developers need to guard 
against stripping down assessments too much. 

Does the item... 

Have a clear format for text 

• Standard typeface. 

• Twelve (12) point minimum size for all print, including captions, footnotes, and graphs (type size 
appropriate for age group), and adaptable font size for computers. 

• High contrast between color of text and background. 

• Sufficient blank space (leading) between lines of text. 

• Staggered right margins (no right justification). 

Notes: 

a. Use clear, common, familiar, and consistent fonts (e.g., fonts with wide spacing between letters, 
words, and lines such as Times or Arial). 

b. Avoid decorations and flourishes. 

c. The term “blank space” rather than “white space” may be more accurate because the 
background is not always white. 

d. A student should never be confused over type face. 

e. Twelve-point varies with font types, as does spacing between letters. 

f. Some readers may be unable to see more typical print sizes clearly (e.g., 12 point). 

g. When selecting color in text or background, consider high print density contrast in order to avoid 
isoluminance (i.e., colors appearing equivalent for student with color blindness). 

Does the item... 

Have clear visuals (when essential to item) 

• Visuals are needed to answer the question. 

• Visuals with clearly defined features (minimum use of gray scale and shading). 

• Sufficient contrast between colors. 

• Color alone is not relied on to convey important information or distinctions 

• Visuals are labeled. 

Notes: 

a. Pictures should have a purpose other than simply to be decorative. 

b. Weigh whether the use of a visual (e.g., illustration) helps students or interferes with the content 
being tested. This is a judgment call, but should be carefully considered. 
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c. Labeling pictures is helpful, when possible. This is true even if the picture seems obvious. 

d. Give additional clues besides color when possible (e.g., use text label “stop” and “go” next to 
stop light images). 

e. Visuals should not distract or affect students who do not need the visual cue; rather they should 
add value for students who do need visual cues. 

f. Pictures, when used, should support text and provide an additional resource for students to 
construct meaning. 

g. There may be instances where grayscale and shading are appropriate for providing relevant 
information to students. 

Does the item... 

Have concise and readable text 

• Commonly used words (except vocabulary being tested). 

• Vocabulary appropriate for grade level. 

• Minimum use of unnecessary words. 

• Idioms avoided unless idiomatic speech is being measured. 

• Technical terms and abbreviations avoided (or defined) if not related to the content being 
measured. 

• Sentence complexity is appropriate for grade level. 

• Question to be answered is clearly identifiable. 

Notes: 

a. The use of common words depends on the content assessed. For example, if vocabulary is 
being tested, difficult or uncommon words might be appropriate to include. 

b. Use of commonly used words does not assume that the words are simple. They may carry 
multiple meanings. 

c. Commonly used words may be replaced by reference words most frequently appearing in similar 
text or listed as the primary definition in the dictionary. 

d. Some students may know less common words but may not know phrasal words. It is difficult to 
assume what is uncommon or difficult. 

e. Other than the terms relevant to the construct being measured, items should use basic language 
and vocabulary (as they are different), and might even be one grade level below the grade being 
tested. 

f. If reading is not the primary construct tested, keep reading level at or below grade level in order 
to minimize construct irrelevant variance. 

g. With the exception of subject specific terminology, the text of an item should consist of 
commonly understood words or vocabulary consistent with each specific grade level. 

h. Determination of complexity can include many factors such as use of clauses, use of the 
passive, number of syllables in a word, length of sentences, length of single passage, combined 
length of all reading passages, amount of extraneous text involved in non-reading problems, etc. 
Complexity needs to be considered on all tests. 

i. When using authentic texts on reading passages, complexity may be difficult to control, but 
should at least be considered on test questions. 
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Does the item... 

Allow changes to its format without changing its meaning or difficulty (including visual or 

memory load) 

• Allows for the use of braille or other tactile format. 

• Allows for signing to a student. 

• Allows for the use of oral presentation to a student. 

• Allows for the use of assistive technology. 

• Allows for translation into another language. 

Notes: 

a. Not all items can be brailled or made tactile in a meaningful way. Under such circumstances, 
items may need to be modified so that students with visual impairments can answer equivalent 
items testing knowledge of the same content. 

b. Validity may be compromised when critical and relevant words/concepts cannot be translated 
into different languages. 



Overall test considerations: 



Does the test... 

Have an overall appearance that is clean and organized 

• All visuals (e.g., images, pictures) and text provide information necessary to respond to the item. 

• Information is organized in a manner consistent with an academic English framework with a left- 
right, top-bottom flow. 

• Booklets/materials can be easily handled with limited motor coordination. 

• Response formats are easily matched to question. 

• Place for student to take notes (on the screen for CBT) or extra white space with paper-pencil. 

Notes: 

a. Images, pictures, and text that may not be necessary include sidebars, overlays, callout boxes, 
visual crowding, shading, and general busyness — anything that may distract a student. 

b. Carefully consider whether students from some groups may misinterpret the flow of text and 
graphics based on characteristics of their native language or culture. Left-right and top-bottom 
flow is cultural. For example, in some languages, text may flow top to bottom. 

c. When using “authentic visuals” (e.g., a map), carefully consider what is being tested in addition 
to the intended content. 

d. If a test has a time limit, careful consideration should be given to why a time limit is necessary 
for the content being tested. 

e. Readability indices are now found in major word processors, however, it is important to check to 
see they are working as intended. 

f. Check to see which tests allow oral presentation. 

g. Involve members of the major language groups in item review committees. 

h. Consideration of these issues prior to administration of a test will also help with the 
administration of oral interpretations in the native language, if allowed on a content test. This 
issue is also relevant for sign language interpretations of tests, when appropriate. 

i. There are many ways to translate content area assessments, such as side-by-side, or 
developing parallel forms. Carefully consider the plusses and minuses of each way prior to 
making a decision about your state test. There is no perfect solution. 

j. Test items are piloted, field tested, and normed on all subgroups for which the measure is 
designed. 

k. For computerized tests, students would need to have critical keyboard and navigation skills. 
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In addition to the other considerations, a computer-based test should have these 
considerations: 



Layout and design 

• Sufficient contrast between background and text and graphics for easy readability. 

• Color alone is not relied on to convey important information or distinctions. 

• Font size and color scheme can be easily modified (through browser settings, style sheets, or 
on-screen options). 

• Stimulus and response options are viewable on one screen when possible. 

• Page layout is consistent throughout the test. 

• Computer interfaces follow Section 508 guidelines (www.section508.gov). 

Notes: 

a. The design of the color of the test needs to take into account test takers with color blindness 
including red/green distinctions. 

b. The electronic format needs to be accessible through the specific assistive technology the 
student has experience using during the testing. The latest technology may not be what is used 
in the schools. 

c. More recent and specific computer accessibility guidelines are available at www.aph.org, 
Microsoft, WCAG etc. 

Navigation 

• Students have received adequate training on use of test delivery system. 

• Navigation is clear and intuitive; it makes sense and is easy to figure out. 

• Navigation and response selection is possible by mouse click or keyboard. 

• Option to return to items and return to place in test after breaks. 

Notes: 

a. Flow to navigate and navigation symbols should be intuitive and/or explained at the beginning of 
a test. 

b. The screen resolution varies on different computers. Reviewers should check out items on 
different types of computers commonly used in schools. 

c. Schools need reasonable minimum standards for computer and audio requirements for a test. 

d. Test administration instructions should include standardized settings for the computer. 

e. Students need practice opportunities before taking computer-based tests. 

f. Some listening tests may want to limit the number of times a student can listen to a recording, 
depending on standards being tested. 

Screen reader considerations 

• Item is intelligible when read by a text/screen reader. 

• Links make sense when read out of visual context (“go to the next question” rather than “click 
here”). 

• Non-text elements have a text equivalent or description. 

• Tables are only used to contain data, and make sense when read by screen reader. 

Notes: 

a. Images and animations have text labels, /'ft his does not supply the answer. 

b. Captioning and transcripts of audio and video are available. 

c. Provide titles and summaries for tables and graphs. 

d. Header cells for columns and/or rows are designated. 

e. Information in tables makes sense when linearized (i.e. , read top left to bottom right cell). 

f. Current screen reader technology might be difficult for ELLs to understand, real voice 
technology may be needed. 
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Test specific options 

• Access to other functions is restricted (e.g., e-mail, Internet, instant messaging). 

• Pop up translations and definitions of key words/phrases are available if appropriate to the test. 

• Students writing online can get feedback on length of writing on-demand in cases where there is 
a restriction on number of words. 

• Students are able to record their responses and read them back as an alternative to a human 
scribe. 

• Students are allowed to create persistent marks to the extent that they are already allowed to on 
paper-based booklets (e.g., marking items for review; eliminating multiple choice items, etc.). 

Notes: 

a. Access to spell check might also be limited depending on the test. 

b. Variable audio speed might be useful to some students if it does not interfere with the standard 
being tested. 

c. The option for feedback on demand on the length of student writing would depend on the extent 
that keeping text length within some parameter is part of the construct being measured. 

Computer capabilities 

• Adjustable volume. 

• Speech recognition available (to convert user’s speech to text). 

• Test is compatible with current screen reader software. 

• Computer-based option to mask items or text (e.g., split screen). 

• Computer software for test delivery is designed to be amenable to assistive technology. 

Notes: 

a. Alternate versions of computer interface provided that is amenable to use with screen readers 
(e.g., JAWS, Window-Eyes). 
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Appendix 

Supporting Statements by Researchers 



Measures what it intends to measure 

• Test development begins with a careful consideration of the skills proposed for measurement 
(Popham & Lindheim, 1980). 

• Every item should reflect specified content and mental behaviors, as called for in test 
specifications (Haladyna, Downing, & Rodriguez, 2002). 

• Removal of construct irrelevant variance increases tests scores for students with reading 
difficulties (Calhoun, Fuchs & Hamlett, 2000; Harker & Feldt, 1993; Koretz, 1997; Tindal, Heath, 
Hollenbeck, Almond & Harniss, 1998). 

• Language in non-language arts assessments needs to be “transparent” enough to students to 
clearly determine construct being measured (Sharrocks-Taylor & Hargreaves, 1999). 

Respects the diversity of the assessment population 

• Items must be reviewed for bias that may exist against particular populations (National Research 
Council, 1999). 

• Items that are designed from the start with equity and accessibility features are less likely to be 
biased against particular populations (Kopriva, 2000). 

• Items must be free of content that makes a student’s socioeconomic status or inherited 
academic aptitudes the dominant influence on how a student will respond to the item (Popham, 
2001 ). 

• Items must be free of content that may unfairly benefit or penalize students from diverse ethnic, 
socioeconomic, or linguistic backgrounds, or students with disabilities (Popham, 2001 ). 

• Cultural norms, beliefs, and customs need to be respectfully reflected in illustrations (Schiffman, 

1995). 

Has a clear format for text 

• The point sizes most often used are 10 and 12 point for documents to be read by people with 
excellent vision reading in good light (Gaster & Clark, 1995). 

• Fourteen point type increases readability and can increase test scores for both students with 
and without disabilities, compared to 12-point type (Fuchs, Fuchs, Eaton, Hamlett, Binkley, & 
Crouch, 2000). 

• Type size for captions, footnotes, keys, and legends needs to be at least 12 point (Arditi, 1999). 

• Larger type sizes are most effective for young students who are learning to read and for 
students with visual difficulties (Hoener, Salend, & Kay, 1997). 

• Large print is beneficial for reducing eye fatigue (Arditi, 1999). 

• Shapes of letters and numbers should enable people to read text “quickly, effortlessly, and with 
understanding” (Schriver, 1997). 

• The relationship between readability and point size is also dependent on the typeface used 
(Gaster & Clark, 1995; Worden, 1991). 

• Letters that are too close together are difficult for partially sighted readers. Spacing needs to be 
wide between both letters and words (Gaster & Clark, 1995). 

• Fixed-space fonts seem to be more legible for some readers than proportional-spaced fonts 
(Gaster & Clark, 1995). 

• Leading should be 25-30 percent of the point (font) size for maximum readability (Arditi, 1999). 

• Leading alone does not make a difference in readability as much as the interaction between 
point size, leading, and line length (Worden, 1991). 
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• Standard typeface, upper and lower case, is more readable than italic, slanted, small caps, or all 
caps (Tinker, 1963). 

• Text printed completely in capital letters is less legible than text printed completely in lower-case, 
or normal mixed-case text (Carter, Dey, & Meggs, 1985) 

• Italic is far less legible and is read considerably more slowly than regular lower case (Worden, 
1991). 

• Boldface is more visible than lower case if a change from the norm is needed (Hartley, 1 985) 

• Staggered right margins are easier to see and scan than uniform or block style right justified 
margins (Arditi, 1999; Grise, Beattie, & Algozzine, 1982; Menlove & Hammond, 1998). 

• Justified text is more difficult to read than unjustified text — especially for poor readers (Gregory 
& Poulton, 1970; Zachrisson, 1965). 

• Justified text is also more disruptive for good readers (Muncer, Gorman, Gorman, & Bibel, 

1986). 

• A flush left/ragged right margin is the most effective format for text memory. (Thompson, 1991). 

• Unjustified text may be easier for poorer readers to understand because the uneven eye 
movements created in justified text can interrupt reading (Gregory & Poulton, 1970; Hartley, 

1985; Muncer, Gorman, Gorman, & Bibel, 1986; Schriver, 1997). 

• Justified lines require the distances between words to be varied. In very narrow columns, not 
only are there extra wide spaces between words, but also between letters within the words 
(Gregory & Poulton, 1970). 

• Longer lines, in general, require larger type and more leading (Schriver, 1997). 

• Optimal length is 24 picas — about 4 inches (Worden, 1991). 

• Lines that are too long make readers weary and may also cause difficulty in locating the 
beginning of the next line, causing readers to lose their place (Schriver, 1997; Tinker, 1963). 

• Lines of text should be about 40-70 characters, or roughly eight to twelve words per line 
(Heines, 1984; Osborne, 2001; Schriver, 1997). 

• Blank space anchors text on the paper and helps increase legibility (Menlove & Hammond, 

1998; Smith & McCombs, 1971). 

• A general rule is to allow text to occupy only about half of a page. Too many test items per page 
can make items difficult to read (Tinker, 1963). 

Has clear pictures and graphics (when essential to item) 

• Graphics with a clear sense of unity, a clear focal point, and balance reduce the cognitive load of 
perceiving information and computer-based tests should allow students to change the size of the 
font (see computer specific considerations below) and thus increase speed with which the user 
can access graphic material (Szabo and Kanuka, 1998). 

• If illustrations are present they are at best essential information, good if they support the 
information, and unnecessary if they are unrelated to the construct or item (Sharrocks-Taylor & 
Hargreaves, 1999). 

• Illustrations should be placed directly next to the information for which they refer (Silver, 1994; 
West, 1997). 

• Placing labels directly on plot lines of graphs reduces the load on short-term memory (Gregory & 
Poulton, 1970). 

• Quantitative displays should be structured so that readers can easily construct appropriate 
inferences about the data (Schriver, 1997). 

• Illustrations should be placed directly next to the information for which they refer (Silver, 1994; 
West, 1997). 

• Graphs, illustrations, and other graphic aids can facilitate comprehension (Rakow & Gee, 1987) 
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Has concise and readable text 

• General readability principles such as fewer words per sentence and the removal of irrelevant 
difficult words increases comprehension of items (Popham & Lindheim, 1 980; Rakow & Gee, 
1987). 

• Flow of sentences is also an important feature. Caution should be taken when reducing reading 
load so that sentences do not become disjointed or incomprehensible (Anderson, Hiebert, Scott, 
& Wilkinson, 1985). 

• Compound sentences can be written in two separate sentences (if sentences are still 
comprehensible) (Gaster & Clarke, 1995). 

• Most important ideas should be stated first in a sentence (Gaster & Clarke, 1995). 

• Noun-pronoun relationships should be clear (Gaster & Clarke, 1995). 

• Illustrations should be placed close to the text they support (Gaster & Clarke, 1995), or removed 
if they do not support text. 

• Readability increases when students have likely had experiences or prior knowledge relating to 
items (Rakow & Gee, 1987). 

• Content within items is clearly organized (Rakow & Gee, 1987) 

• The content of every item should be independent from content of other items on the test 
(Haladyna et al., 2002) 

• Questions are clearly framed (Rakow & Gee, 1987) 

• Limit the number of words, difficulty of words (Popham & Lindheim, 1 980), and grammatical 
complexity of test materials (Popham & Lindheim, 1980) 

• Keep vocabulary simple for the group of students being tested (Haladyna et al., 2000). 

• Minimize the amount of reading in each item (Haladyna et al., 2002). 

• Avoid window dressing (excessive verbiage; Haladyna et al., 2002). 

• Simple, clear, commonly used words should be used whenever possible (Gaster & Clarke, 

1995). 

• Technical terms should be defined (Gaster & Clarke, 1995). 

• One idea, fact, or process should be introduced at a time, then ideas developed logically (Gaster 
& Clarke, 1995). 

• If time and setting are important to the sentence, they should be placed at the beginning of the 
sentence (Gaster & Clarke, 1995). 

• Sequence steps of instructions in the exact order that they will be needed (Gaster & Clarke, 
1995). 

• Vocabulary should be grade-level appropriate (Rakow & Gee, 1987). 

• Sentence complexity must be appropriate for grade level (Rakow & Gee, 1987). 

• Definitions and examples must be clear and understandable (Rakow & Gee, 1987). 

• Required reading skills are appropriate for students’ cognitive level (Rakow & Gee, 1987). 

• Use of plain language: “text-based language that is straightforward, concise, and uses everyday 
words to convey meaning. The goal of plain language editing strategies is to improve the 
comprehensibility of written text while preserving the essence of its message.” (Hanson, Hayes, 
Schriver, LeMahieu, & Brown, 1998, p.2). 

• Reduce the verbal and organizational complexity of test items while preserving their essential 
content (i.e., the skills and concepts they were intended to measure.) (Hanson et al, 1998, p.2). 

• Reduce excessive length; reduce wordiness and remove irrelevant material (Brown, 1999). 

• Eliminate unusual or low frequency words and replace with common words (e.g., replace “utilize” 
with “use”) (Brown, 1999). 

• Avoid ambiguous words (e.g., crane) (Brown, 1999). 

• Avoid irregularly spelled words (e.g., trough, feign) (Brown, 1999). 

• Avoid proper names, replace with common names or no names at all (Brown, 1999). 
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Allows changes to its format without changing its meaning or difficulty (including visual or 

memory load) 

• Construct irrelevant graphs, vertical text, untranslatable material, and decorative graphics all 
create situations where accommodating students who use braille, American Sign Language, or 
non-English languages is difficult. 

Additional considerations for computer-based assessments 

• Students reported difficulties with computers including excessive need for forward and back 
buttons, unfamiliarity with response mechanisms, and an inability to see entire problems on 
screens (Trotter, 2001). 

• Students may not be familiar with skills like scrolling or using text on multiple screens (Cole, 
Tindal, & Glasgow, 2000). 

• Some students have had little access to computers and calculators prior to testing (Bridgeman, 
Harvey, & Braswell, 1995; MacArthur& Graham, 1987). 
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Considerations for Universally Designed Assessment Items 
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